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Abstract 

RNA motifs typically consist of short, modular patterns that include base pairs formed 
within and between modules. Estimating the abundance of these patterns is of fundamental 
importance for assessing the statistical significance of matches in genomewide searches, and for 
predicting whether a given function has evolved many times in different species or arose from a 
single common ancestor. In this manuscript, we review in an integrated and self-contained man- 
ner some basic concepts of automata theory, generating functions and transfer matrix methods 
that are relevant to pattern analysis in biological sequences. We formalize, in a general frame- 
work, the concept of Markov chain embedding to analyze patterns in random strings produced 
by a memoryless source. This conceptualization, together with the capability of automata to 
recognize complicated patterns, allows a systematic analysis of problems related to the occur- 
rence and frequency of patterns in random strings. The applications we present focus on the 
concept of synchronization of automata, as well as automata used to search for a finite number 
of keywords (including sets of patterns generated according to base pairing rules) in a general 
text. 

1 Introduction 

The importance of RNA in biology is increasing as we learn more about the function of RNA 
molecules. Some RNA molecules are passive messengers in translation (a step in the production of 
protein molecules from the DNA genome), but RNA molecules can also act as a catalysts [CZG81 
lGTGM+83] . Recent estimates suggest that the human genome may encode up to 75 ,000 small 
RNA genes, which is at least three times the number of protein-coding genes [LTL + 05] , Because 
new functional RNA molecules are being discovered every day, the problem of understanding the 
structure and sequence requirements for RNA function is of increasing importance. 

Functional RNA molecules share important structural and sequence characteristics. These RNA 
molecules typically consist of short, evolutionarily conserved regions (modules) that are separated by 
essentially random spacer sequences that can vary both in length and nucleotide sequence |KY03j . 
Modules often base pair with each other, an effect which introduces long-range correlations among 
parts of the sequence. (For more detailed definitions of patterns and modules, see below.) 

If a particular motif corresponds to a functional RNA molecule, the corresponding modular 
pattern may be statistically over- or underrepresented in the genome. This assumption is used in 
genomewide searches for possible functional RNA molecules. The estimation of over- or underrepre- 
sentation requires us to calculate the probability that the modular pattern occurs in some statistical 
model of the genome sequence. Therefore, the study of RNA sequences is directly related to pattern 
matching and the probability of occurrence of patterns in random strings. 

Traditionally, sequence similarity between genetic sequences in different organisms has been 
interpreted to mean that the gene in both organisms share a common ancestor. This assumption 
underlies many sequence analysis algorithms. However, increasing evidence suggests that sequence 
similarity may not always imply common descent of RNA molecules. This may occur because, 
despite the diversity of functional RNA molecules, some functions can only be evolved in a relatively 
small number of ways. For example, the hammerhead ribozyme, a self-cleaving RNA that has 
an evolutionarily conserved catalytic core of only 11 nucleotides, has both been observed in a wide 
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range of organisms and has also been artificially selected from random-sequence backgrounds [TBOO 
SAS01 . Similarly, artificial selection of RNAs from random sequences has recaptured the sequence 
of the catalytic core of the ribosome |WMY97l lYWOOj and features of the genetic code |YC K05 . 
Therefore, study of RNA molecules may require new models of sequence evolution that characterize 
the origins of a motif from a random sequence. 

Although probabilistic models for sequence evolution from a common ancestor are well-established 
Kim81 Fcl81 , probabilistic models for independent origins of an RNA motif in random-sequence 
backgrounds have been less well studied [KY031 lKDSM + 05] . The long-range correlations introduced 
by base pairing can be difficult to accommodate in search algorithms. Paired RNA motifs cannot 
be represented as regular languages, but instead must be represented as context-free grammars 
for full generality [ED94 REOOJ. Gcnomcwide searches have been performed for several functional 
RNA motifs |BFP+99| IfBP+00| IKE031 lGJMM+05) . In these searches, the statistical significance 
of matches has typically been assessed using Monte Carlo simulations, in which the search is re- 
peated using randomized versions of the search text. This procedure has significant limitations: it is 
time-consuming and cannot accurately estimate the low p-values that are important for computing 
likelihood ratios for rare events. Therefore, new methods for computing the probability of sequences 
in random strings is important for determining the statistical significance of RNA motif searches. 

Previous work has focused on patterns related to RNA structure. However, recent work has 
developed other pattern-matching problems related to RNA sequences. For example, multiple short 
protein- or RNA-binding sequence motifs can combine to regulate a range of biological processes, 
including splicing and polyadenylation Sli Kl)(i . Similarly, 6-base seed sequences that bind short 
microRNA molecules (miRNAs) appear to work in concert to repress translation [LBB05 . Current 
evidence suggests that while the motifs function combinatorially (several motifs must be present 
together for biological function), no results suggest that specific base pairing between the modules 
is required for function. Therefore, for sequence analysis the motifs can be treated as uncorrelated 
(although the sequences within each module can be compound). 

RNA and other biological sequences are not intrinsically random. However, computational bi- 
ologists model these sequences as random (using different models) to assess which patterns within 
the sequence are likely to be biologically significant. Therefore the modeling of RNA as a random 
sequence is a mathematical construction rather than a biophysical model of RNA. 

Here we review a broad selection of approaches to estimating the expected number of matches 
to, or the probability of occurrence of, RNA motifs. We consider motifs both with and without 
correlations (such as base pairing). These approaches draw from many branches of mathematics 
and computer science. Progress in this field has been limited by the difficulty of integrating results 
from different fields that use different concepts and terminology (see section [2] for a glossary of 
terms). Terminology and previous work is reviewed in section [3j In section [4] we give mathematical 
definitions and proofs of key concepts in deterministic pattern matching. These concepts include the 
use of automata to search for keywords in databases, and synchronization of automata to search for 
multiple patterns simultaneously. We also give independent proofs that the Aho-Corasick automaton 
matches compound patterns, even in cases in which keywords are subpatterns of other keywords. In 
section [5] we formalize the Markov chain embedding technique for probablistic pattern matching. 

Sections [6] and [7] describe examples where the methods are applied to sequence analysis problems 
relevant to RNA motif searches. The examples we give here, which rely on memoryless sources, 
demonstrate how automata theory is a fundamental tool to analyze the occurrence of patterns in 
random strings. We also provide references that extend these examples to Markovian models, and 
to other, more complex, models. Our examples rely on the matrix representation of probabilities 
and generating functions extracted from the graphs associated with the automata. As expected, 
the generating functions are rational functions (i.e., ratios of polynomials). This feature allows the 
use of well-known techniques to analyze their asymptotic behavior for long random strings, such as 
those encountered in genomewide analysis. We will focus primarily on sooner-times (i.e., the first 
occurrence of any item from a list of patterns in a random sequence) and count statistics (i.e., the 
number of occurrences of a pattern in a random sequence) . 
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2 Glossary of terms 



alphabet: a finite set of characters used to build a text. Example: for RNA sequences, the alphabet 
contains the four nucleotides A, C, G, and U. 

autocorrelation polynomial: a polynomial in one variable that quantifies the degree to which a 
word overlaps with itself. 

automaton: see deterministic finite automaton; also called state machine. 

Bernoulli source: a model for the generation of a random string in which the probability of a 
given character is fixed, independently of the characters appearing elsewhere in the string; also 
called a memoryless model. 

character: an element of an alphabet; also called letter or symbol. 

compound pattern: any finite set of strings, usually but not always consisting of words with a 
common or similar structure; also called a degenerate pattern. Example: AAC{U,T}CCG 
is a compound pattern of two 7-letter strings where the fourth letter in either string can be 
cither U or T. 

correlation: a dependency between two positions in a pattern, such as that introduced by modeling 
base pairing in RNA or DNA. Example: a correlation would exist if positions 3 and 7 in a 
pattern must base pair with each other; these positions can be filled by any letters as long as 
they form a base pair. 

De Bruijn graph: an automaton, the states of which track the last k characters read in a text. 

deterministic finite automaton: an abstract representation of a (regular pattern) search algo- 
rithm, with a finite set of states and rules that specify transitions between states. It is usually 
represented as a graph composed of a finite number of states (the nodes), transitions between 
states (the edges) and actions performed upon entering or leaving a state (conveyed by labels 
on the edges); also called finite-state automaton or finite-state machine. For keywords match- 
ing, there is an initial state representing the state before any characters have been matched, 
and final states representing matches with the keywords. 

dynamic source: a particular type of probabilistic model for the generation of a random sequence, 
for which the probability of a character may depend on all the preceding characters. Bernoulli 
and Markovian sources are particular instances of dynamic sources. 

edit distance: the distance between two strings, as calculated by summing the cost of each ele- 
mentary operation (e.g., character insertion, deletion or substitution) required to convert one 
string into the other. 

generalized word: a compound pattern for which all words in the pattern have the same length. 

generating function: a function in one or more variables for which the Taylor series coefficents of 
the function correspond to probabilities or expected values associated with a discrete random 
variable or vector of interest. 

hidden pattern: a pattern which may appear separated into blocks rather than as a single and 
continuous block within the text. Example: the pattern AG A appears four times in the string 
ACAGCCUGA as a hidden pattern. 

keyword: see string. 

language: see pattern. 

letter: see character. 

Markov chain: a sequence of random variables (here taking values in an alphabet) for which the 
probability of the value taken by a random variable is determined by the values of the k 
previous variables. The parameter k > is a finite constant that corresponds to the Markov 
order of the chain. 
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Markovian source: a model for the generation of a random sequence for which the probability of 
a character depends on the k preceding characters, where k > is the Markov order of the 
sequence. 

memoryless source: see Bernoulli source, 
module: see modular pattern. 

modular pattern: an ordered list of simple or compound patterns that may include correlations 
within or between patterns. Example: in the context of the DNA alphabet, AAYl...2'Y GT is 
a pattern with two correlated modules, namely AAY1 and 2'1'GT, where 1, 2 and 1', 2' denote 
correlations between the first and second module. Here A' = T, T' = A, G' = C and C = G. 

non-overlap counting: the total number of substrings of a text that match a pattern, as the text 
is read from left to right, where a given substring can only be considered once for a match; 
also called renewal counting. 

overlap counting: the total number of substrings of a text that match a pattern. 

pattern: a set of strings. The strings in the pattern usually (but not necessarily) are similar to each 
other. These include simple patterns, modular patterns, correlated modular patterns, and any 
set of words specified by a regular expression; also called language. 

prefix: a substring which corresponds to the start of a string. Every word is a prefix of itself. 
Example: A, AA, AAC, and AACU are prefixes of the string AACUCCG. 

reduced pattern: a pattern where no string is a substring of another string in the pattern. 

renewal counting: see non-overlap counting. 

regular expression: a string that describes all and only those strings belonging to a regular lan- 
guage. 

regular language: see regular pattern. 

regular pattern: a set of strings that can be recognized by a deterministic finite automaton; also 
called regular language. 

run: a maximal sequence of identical characters in a text. Example: the binary string 0010011110 
has three runs of zeros, namely 00, 00 and 0, and two runs of ones, 1 and 1111, respectively. 

state machine: see deterministic finite automaton; also called automaton. 

string: a specific sequence of alphabet characters; also called keyword or word. 

substring: a consecutive list of characters within a string; also called sub-word. 

suffix: a substring which corresponds to the ending of a string. Every word is a suffix of itself. 
Example: G, CG, GGG, and UCCG are suffixes of the string AACUCCG. 

sub-word: see substring. 

symbol: see character. 

sooner-time: the time required before an event is observed; here corresponds to the length of text 
that precedes the first occurrence of a pattern. 

simple pattern: a pattern where each position in the string is exactly specified by one letter. 
Example: AACUCCG is a 7-letter simple pattern. 

suffix: a substring which corresponds to the end of another string. Every word is a suffix of itself. 
Example: G, CG, CCG, and UCCG are suffixes of the string AACUCCG. 

text: a usually long sequence of characters in which patterns may occur. May be randomly gener- 
ated according to a probabilistic model. 
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transition matrix: the matrix that summarizes the probability that a Markov chain undergoes a 
transition from one state to another. 

transfer matrix: a matrix with polynomial entries, here used to keep track of the number of visits 
that a Markov chain makes to a certain set of states. 

word: see string. 

3 Prior work on pattern matching 

In this section we summarize much of the work that uses automata and Markov chains to study pat- 
terns in random strings. For an introduction to automata theory and regular expressions see |HU79I 
|Sip96| . See (Wat95 RRS05 for an introduction to pattern analysis of biological sequences. A 
comprehensive discussion of patterns in random strings can be found in the book of Lothaire et 
al. |LRD + 05) . Other references give useful background on the mathematical techniques discussed in 
this paper. An introductory treatment of generating function methods can be found in |Wil94] . See 
FS06J for a broader discussion on generating function and transfer matrix methods. Supplementary 
references on Markov chains include |Brc98 , Dur99 ; for a more detailed discussion of Markov chains 
the reader is referred to |Fel681 lDur04| . 

3.1 Terminology 

3.1.1 Determistic versus probabilistic pattern matching. 

Early work in computer science focused on deterministic pattern matching, where the text to be 
analyzed is given (for example, the abstract of this paper) and one wishes to search the text for 
a given pattern. The number of occurrences of the pattern has a definite answer once the text is 
given. For applications of pattern matching to problems in biological sequence data, one is typically 
interested in probabilistic versions of the pattern matching problem. Therefore this review focuses on 
research in probabilistic pattern matching. Here one models the biological sequence as a random 
string produced according to some model. Typically, the sequence is assumed to be produced by 
a memoryless source or a Markovian source, although hidden Markov models are sometimes used. 
The pattern matching problem can then be formulated as a probabilistic question; different papers 
answer slightly different questions. Previous work can be categorized as problems involving (i) 
counting (what is the probability that a given pattern occurs m times in a random string of length 
n?), (ii) occurrence (what is the probability that a given pattern occurs or does not occur in a random 
string of length n?), (iii) type of occurrence (what is the probability that a string in a pattern is the 
first one observed?), and (iv) distance between occurrences (what is the typical distance between 
successive occurrences of a pattern in a random string of length nl). Note that the question of 
occurrence probability (ii) is a special case of the counting problem (i). 

3.1.2 Type of pattern. 

Research in this field has considered a range of different kinds of pattern. The most basic case is a 
simple pattern. A simple pattern is a string where each position in the string is exactly specified 
by one letter; the word dog would be an example of a simple pattern. A compound pattern is 
a finite set of simple patterns; for example, a keyword search for the words dog, cat, and snake 
would seek to match a compound pattern. Compound patterns are sometimes specified by letting 
some positions in the string be chosen from a range of characters. For example, the words snake 
and snare could be represented by the compound pattern sna{k,r}e. A pattern is referred to as 
forbidden if the pattern matching problem seeks to exclude occurrences of the pattern, rather than 
find occurrences of it. 

A correlated pattern contains positions where characters must be related by some rule. For 
example, one could search for the correlated pattern lol with a rule that positions marked by the 
number 1 must be the same letter. A search for such a correlated pattern would find all 3-letter 
words where the first and third letters are the same, such as mom and tot. 

A modular pattern is composed of subpatterns which must appear in a certain order, but 
which could be separated by one or more characters. For example, a search for the modules cat... dog 
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would match examples in the text where the word cat occurs, followed by any number of characters, 
followed by the word dog. The number of characters allowed between the modules can be unbounded, 
bounded, or specified uniquely. For instance, cat##...dog would match examples where cat occurs, 
followed by at least two characters followed by dog. A modular pattern could also contain correlations 
within or between modules. A modular pattern may include an infinite number of simple patterns. 

A regular pattern is a pattern that can be described by a regular expression of the type used 
in computer science. 

3.1.3 Overlaps. 

When matching more complicated patterns, one must specify how to deal with overlaps of words. 
The overlapping count of a pattern in a text corresponds to the number of substrings of the 
text that belong to the set of words specified by the pattern. For example, there are 4 overlapping 
occurrences of TATA and 5 overlapping occurrences of ATA in the text ATATATATATA; therefore, 
there are 9 overlapping occurrences of the compound pattern TATA, ATA in this text. There are 
only 3 overlapping occurrences of the modular pattern TA#...TATA in the text ATATATATATA. 
To determine a non-overlapping count, one reads the text from left to right. Every time a 
match with the pattern is encountered, the matched word and all characters to its left are removed 
before continuing the count. For instance, there are 2 non-overlapping occurrences of TATA, 3 
non-overlaping occurrences of ATA, and 1 non-overlapping occurrence of TA#...TATA in the text 
ATATATATATA. 

3.2 Automata, Probability, and Counting 

Important early work in pattern matching was done by Aho and Corasick [AC 75] , who constructed 
an automaton (now known as the Aho-Corasick automaton) to search for a finite set of keywords 
in a text. Their work was focused on bibliographic search and was therefore deterministic. The 
Aho-Corasick automaton is an example of a deterministic finite automaton (DFA) that we describe 
in detail in section 14.51 

We can imagine an Aho-Corasick automaton that recognizes the word abba. Such an automaton 
would contain five states, numbered 1 (the empty string), 2 (a), 3 (ab), 4 (abb), and 5 (abba). A 
text is processed by the automaton one letter at a time, from left to right. The automaton stays in 
state 1 until the presence of an a in the text triggers a transition to state 2. If the next letter in 
the text is b, the automaton would then move to state 3; otherwise the automaton would remain in 
state 2. (See figure [T|for a sketch of the transition rules for this automaton.) If the automaton is in 
state 4 and the next letter encountered is a, then the automaton would transition to state 5 which 
is associated with the detection of the keyword abba. 

In automata used for pattern matching, prefixes of a word that are also suffixes of the word 
play an important role in determining the structure of the automaton. This structure is used in 
mathematical techniques to count the number of words that contain or forbid certain patterns. 
The autocorrelation of a string and more generally the correlation between two strings 
introduced by Guibas and Odlyzko quantifies this idea |GQ81b| . The autocorrelation of a string x 
is a string of O's and l's — of the same length as a; — which gives information about the matches 
of x with itself. The autocorrelation of x is denoted Aut[x] and a 1 occurs at position n in Aut[x] 
if and only if x has a prefix of length n which is also a suffix of x. For instance, if x = abbab 
then Aut[x] = 01001. The autocorrelation polynomial of a string x, denoted Aut[x; 2], is the 
polynomial in the variable z obtained by summing up all the monomials of the form z n ~ l for which 
the n-th character of Aut[x] is a 1. For instance, if x = abbab then Aut[ir; z] = z + z 4 . If x is a string 
constructed with characters in an alphabet of size s and f(n) is the number of strings of length n 
that do not have any occurrence of a; as a substring then 



See |GQ81b] for generalizations of the above identity to consider more than just one forbidden 
strings. 



n=0 
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Further discussion of automata theory is found in references [HU79) [Sip96[ ICR02) . Combinato- 
rial theory of pattern matching is discussed in references |GJ04[ IFS06| . A thorough discussion of 
autocorrelation polynomials can be found in [LRD+05] . 

3.3 Probabilistic counting 

The use of automata for pattern matching can be extended to consider probabilities of occurrences 
of strings in random texts through the use of probabilistic automata. We can think of a probabilistic 
automaton as a DFA that scans a random text as the text is generated. Transitions between different 
states of the automaton occur according to probabilistic rules, which are determined from the model 
which generates the random text (typically a Markov chain). The probability that a word occurs 
in a random text of a certain length can be determined from the probability that the probabilistic 
automaton visits a specified set of states within a certain number of steps. This translates the pattern 
matching problem into a problem about the behavior of a Markov chain. This correspondence is 
helpful because the theory of Markov chains is a well-established area of probability theory. 

The occurrence of regular patterns in random strings produced by Markov chains (and more 
generally Hidden Markov chains) reduces to problems regarding the behavior of a first-order homo- 
geneous Markov chain in the state space of a suitable DFA. This transformation of the problem is 
often called an embedding technique. As we discuss in sections [5] [6] and [7] the embedding tech- 
nique provides a unifying theoretical framework for many different problems in probabilistic pattern 
matching. 

The Markov chain embedding technique usually corresponds to the embedding of a random 
string into the states of an Aho-Corasick automaton. In this framework, more complicated patterns 
(such as modular correlated patterns) can be treated through the synchronization of Aho-Corasick 
automata associated with each possible combination of correlations. A synchronized automaton 
or product automaton is a new automaton made up of multiple automata which simultaneously 
process a single text. The technique of synchronization is discussed in detail in section [4~4| 

Important early work in probabilistic pattern matching was done by Li, who studied the first 
occurrence of a reduced compound pattern |Li80| . A compound pattern is reduced if no word in the 
pattern is a substring of another word in the pattern. This work focused on a random text produced 
by a memoryless source. Follow up work by Gerber and Li studied the probability of occurrence of 
a reduced compound pattern in a random string produced by a memoryless source |GL81j . Their 
approach is based on martingale methods (of the type introduced in |Li80j ) and Markov chain 
embedding techniques which implicitly use automata and synchronization. The martingale method 
developed by Gerber and Li has been extended by Pozdnyakov and Kulldorff [PK06 in the setting 
of reduced compound patterns and memoryless sources but without the use of the Markov chain 
embedding technique. 

Perhaps the most general computational treatment of pattern frequencies in random sequences 
was carried out by Nicodeme, Salvy, and Flajolct [NSF02J. They considered random strings produced 
either by a Bernoulli or a Markovian model and focused on regular patterns which are of a non- 
degenerate form. A regular pattern is non-degenerate if the DFA that recognizes the pattern is 
irreducible (i.e., from any state it is possible to visit any other state) and primitive (i.e., there is a 
minimal length I such that for any two pair of states in the automaton there exists a path of length 
I that connects the two states). Their analysis is based on automata theory and transfer matrix 
methods. They obtained the generating function associated with the distribution of the number of 
occurrences of a regular pattern in a random text . From the generating functions they computed the 
mean and standard deviation of the Gaussian distribution associated with the number of occurrences 
of the pattern in sufficiently long sequences. When such an approach is applied to biological sequence 
analysis, it allows the determination of z-scores associated with different patterns, and this allows 
researchers to assess the significance of matches. 

Follow-up work by Nicodeme used automata theory and generating functions as the basis of a 
symbolic package called Regexpcount |Nic03] . This software can be used to study the distribution of 
the number of occurrences of various regular expressions in Bernoulli or Markovian sources, including 
simultaneous counts of different motifs. The software can also perform searches for strings at a given 
edit distance from a compound pattern and compute the sooner-time of a string, given a random 
string with a prescribed prefix. The implementation of the automata used in this package relies on 
the concept of Marked automata [NSF02J and synchronization ideas. (We will not extensively discuss 
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Marked automata, but they can be used as an alternative to synchronization.) 

Another important reference in probabilistic pattern matching is the book by Fu and Lou [FL03 . 
This work compiles and extends results of J. C. Fu and coauthors on the Markov chain embedding 
technique |FK94I IFC021 IFC03] . Although automata are not explicitly used in these papers, their 
embedding technique is effectively an implementation of the Aho-Corasick automaton. Their tech- 
nique is applicable to the occurrence or frequency of some compound, possibly modular patterns; 
however, it cannot handle arbitrary regular patterns. Because their calculation technique typically 
requires a large number of states, it has limited computational feasibility 

Regnier and coauthors have made important contributions to probabilistic pattern matching 
RS98 RcgOO RLMOO, RD04 . Regnier and Szpankowski studied overlap counting of a simple pat- 
tern; their work considered a text generated by a first-order, stationary Markov chain [RS98] . Their 
approach can be used to calculate generating functions using a techinque which relics on combina- 
torial relationships between certain languages (sets of words) built from the pattern. They obtained 
relatively explicit forms for the generating functions, in which the autocorrelation polynomial of the 
pattern being studied appears naturally. As a result, they could extract the asymptotic behavior of 
the coefficients that lead to central and large deviation approximations for the distribution of the 
frequency statistic of the pattern. 

In later work, Regnier generalized to fc-th order stationary Markov sequences, compound pat- 
terns, and either overlap or non-overlap counting RcgOO |. The paper gives insight into an aggregation 
procedure of the words in a compound pattern that considerably simplifies the complexity of the 
problem. She defined minimal languages associated with patterns, which contain no redundan- 
cies. (This concept is distinct from the idea of a minimal automaton). Regnier showed that the 
generating functions associated with the minimal languages arc determined by the generating func- 
tions associated with some simpler auxiliary languages — this allows an important simplification 
of the calculations. The computation of expectations, variances and correlations for the number 
of occurrences of the different words in the compound pattern can be expressed explicitly in terms 
of these generating functions. Her method is more computationally efficient than some other ap- 
proaches that use automata to perform the same calculations, provided that the random string is 
produced by a stationary Markov source. 

Two papers by Regnier and coauthors studied the over- and underrepresentation of patterns. 
Regnier, Lifanov, and Makeev focused on compound patterns that are invariant under the reverse- 
complement operation; they were studying the counting of binding sites in double-stranded DNA 
[RLMOO] . This paper calculates z-scores to assess the over- or underrepresentation of patterns in 
random sequences. More recently, Regnier and Denise examined how over- or underrepresentation of 
a pattern can depend on the over- or underrepresentation of a second pattern (because information 
about the frequency of the second pattern modifies the distribution of the first pattern) [R.D04 . 
In this paper, they studied the asymptotic fraction of times that a single pattern is found in a 
random string produced by a memoryless source or a stationary Markov source of order k. The 
result is a large-deviation principle with an explicit rate function and accompanying second-order 
local expansion. The asymptotic expectation and standard deviation of a pattern conditioning on 
the observed sequence of another pattern were determined in a computable way. 

Aston and Martin studied the probability that any of a set of compound patterns is the first 
to be completed a certain number of times |AM05] . They studied binary strings produced by a 
Markovian source. Their method is based on a Markov chain embedding technique and allows the 
possibility that the count may be different for different compound patterns. 

Flajolet, Szpankowski and Vallee studied the total number of occurrences of a hidden pattern 
in a random text generated by a memoryless source [FSV06J. A hidden pattern appears in a 
string if all the characters in the pattern appear in order in the string, although other arbitrary 
characters may appear between the characters in the pattern. For example, the text adenosine 
guanine contains the hidden pattern dog because the letters d, o, and g appear in order in the text. 
In this paper, Flajolet and coauthors derive central limit theorems for the number of occurrences 
of the hidden pattern, using a technique based on generating function methods. However, for what 
they call the fully constrained case (i.e., when the gaps between letters in the hidden pattern are 
constrained to be less than specified finite constants) they utilize De Bruijn graphs and transfer 
matrix methods to obtain more refined results regarding the asymptotic distribution of the frequency 
statistic. Recently, Bourdon and Vallee have extended the analysis of the asymptotic behavior of 
the expected value and variance for number of matches of a hidden pattern in a text generated by 



8 



a dynamic source [BV02 . A dynamic source is a generalization of a Markovian source, where 
the probability of a character may depend on all the preceding characters I Vail) 1 , |CFVQl| . Since 
Bernoulli and Markovian sources are special cases of dynamic sources, use of dynamic source models 
is the most general theoretical framework to study patterns in random strings. Bourdon and Vallee 
also showed that the frequency statistic associated with a regular pattern in a random text produced 
by a dynamical source is asymptotically Gaussian |B VP 6] . 



3.3.1 Forbidden patterns. 

Early work by Guibas and Odlyzko studied forbidden patterns [G081b . This paper addressed the 
probability that a reduced compound pattern does not appear in a random string produced by a 
memoryless source. They introduced the concept of autocorrelation polynomial to find the generating 
function associated with this probability. In probabilistic pattern matching, the autocorrelation 



polynomial of a string is also constructed from its autocorrelation (see section 3.2) but taking into 
account the probabilities associated with the alphabet characters. Sec section 3 in G081bJ for more 
details. 

Gani and Irle also studied forbidden patterns |GI99j . They determined the probability that a 
string of a given length does not contain a type of compound pattern. The patterns they considered 
must be specified cither by a completely repetitive system or a system with a distinctive beginning 
(see their paper for precise definitions). Their approach is primarily computational and based on 
matrix recursion methods. Their method is applicable to a memoryless source or a single forbidden 
string in a text produced by a Markovian source. In the case of a Markovian source, they constructed 
an automaton which is similar to the Aho-Corasick automaton. 



3.3.2 Generalized words. 

Bender and Kochman studied the number of occurrences of generalized words [BK93 . They define a 
generalized word as a set of strings of the same length. They focused on a memoryless source and 
obtained central and local limit theorems for the joint distribution of the number of occurrences of 
generalized words given that a forbidden generalized word does not occur within the random string; 
they were able to obtain explicit formulae only when there are no forbidden generalized words. The 
use of de Bruijn automata and transfer matrices is implicit in their argument. 



3.4 Distance between pattern occurrences 

These papers address the question of the separation between patterns in the text; typically they are 
interested in computing the probability that a pattern first occurs after I characters of the text or 
the probability that two patterns are separated by m characters. 

The sooner-time of a pattern is the number of characters that precede the first occurrence of 
the pattern. Li calculated the expected value of the sooner-time of a reduced compound pattern 
|Li80j . This work focused on a random string produced by a memoryless source. This approach 
is based on martingale techniques and also includes calculation of the probability that any of the 
strings in the compound pattern is first to occur. 

Early work on sooner-times was motivated by the digestion of DNA by restriction enzymes. In 
this experimental protocol, specific enzymes recognize particular DNA sequences, called restriction 
sites; the enzymes cut the DNA at the restriction sites. Typically the restriction site can be described 
by a compound pattern, and one is only interested in non-overlapping occurrences. This is justified 
because enzymes cut the strand at the first position where a string in the compound pattern is 
identified. Breen, Waterman, and Zhang found the generating functions for this problem, assuming 
a random string produced by a memoryless source |BWZ85| . Their analysis is based on renewal 
theory arguments [Fcl68 and autocorrelation polynomials similar to those used in [G081b . Biggins 
and Cannings addressed the more general problem of Markovian sources |BC87] . 

Robin and Daudin determined the exact distribution of (and generating functions associated 
with) the distance between two consecutive (possibly overlapping) occurrences of a reduced com- 
pound pattern [RDOlj . They considered a random string produced by a first-order homogeneous 
Markov chain. Their analysis is related to autocorrelation polynomials; the technique is applied to 
analyze the CHI-motif in the genome sequence of Haemophilus influenza. 
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Han and Hirano studied the distributions of sooner- and later-time for two reduced patterns in a 
random string produced by a first-order Markov chain [HH03J. The later-time of two patterns is the 
number of characters that precede the completion of both patterns. Their paper uses probabilistic 
arguments to determine the generating functions associated with the sooner- and later-time; their 
approach is related to the concept of autocorrelation [GQ78I IGQ81a[ lG0 81bj . They also study other 
statistics such as the distance between two successive occurrences of the reduced patterns. Their 
argument can be adapted to study the sooner-time of a reduced compound pattern. 

Work by Park and Spouge studied the sooner-time and the distance between occurrences for the 
more general case of a random text produced by an irreducible, aperiodic stationary Markov chain 
PS04] . This approach used a Markov chain embedding technique (and implicitly the Aho-Corasick 
automaton). They obtained in closed form the generating function associated with the sooner-time 
and with the statistic of distances between two consecutive occurrences of a reduced compound 
pattern. 

3.5 Related techniques 

3.5.1 Sequence alignment and seed sensitivity. 

Buhler, Keich, and Sun used techniques from automata theory and Markov chains to determine 
optimal seeds for sequence alignment BKS03]. Seeds are short strings which are used as starting 
points in sequence alignment algorithms to reduce the computation time. The approach of Buhler 
et al. allows the design of seeds that are optimal (with respect a specified Markov model). They 
used the concept of a similarity, which is used to quantify the matches between sequences in an 
alignment. Their technique is based on a Markov chain embedding argument over the state space 
of an appropriate Aho-Corasick automaton. 

Martin studied the distribution of the total number of successes (Is) in success runs (sequences 
of Is) longer than a predetermined length in a binary sequence (sequence of Os and Is) produced by 
a Markov source |Mar05] . This work used a Markov chain embedding technique. It is applicable to 
the detection of tandem repeats in DNA sequences: in this case a 1 corresponds to a match between 
two aligned DNA sequences and a to a mismatch. The distribution of the number of successes 
is needed in the detection phase of Benson's tandem-repeats-finder-algorithm [Bcn99 to validate 
candidate sequences via hypothesis testing. 

Kucherov, Noe, and Roytberg used automata theory to address the general problem of deter- 
mining seed sensitivity [KNR06 . In this paper, Kucherov et al. permit the set of allowed seeds 
and target alignments to be described by a DFA and allow the probabilistic model of the target 
alignments to be described by a Hidden Markov model (rather than a finite-order Markov chain). 
Their technique relies on a synchronization argument that involves two DFAs and the HMM. They 
also define a new automaton to specify the seed model that, according to simulation data, performs 
2-3 orders of magnitude better than the Aho-Corasick automaton. 

3.5.2 Random number generators. 

Work by Flajolet, Kirschenhofer, and Tichy studied the distribution of substrings in binary strings 
[FKT88j . Although the motivation for this work is the performance of random number generators, 
the techniques used overlap with the techniques of pattern analysis in random strings. They showed 
that almost all binary strings of length n contain all possible binary strings of length slightly less 
than log 2 (n) a nearly uniform number of times. Their analysis is based on De Bruijn graphs, auto- 
correlation polynomials, and generating functions. 



4 Languages, automata, and synchronization 



In this section, we introduce mathematical notation and definitions to describe regular languages 



(section 4.2 1, automata (section 4.3 ), and synchronization (section 4.4 1. We finalize with a discussion 



about Aho-Corasick automata (section 4.5 1. This section gives a self-contained presentation of the 



key mathematical results and proofs for automata used in deterministic pattern matching. 
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4.1 Main notation 



The alphabet A is a finite non-empty set; the elements in A are characters used to construct strings. 
A string over A is a finite sequence of characters in A. We use lowercase letters (such as x) to denote 
generic strings. The length of a string x, denoted |x|, is the total number of characters (counting 
all repetitions) in the string. The empty string, denoted e, is by definition the only string of length 
zero. We assume that e ^ A, that is, the alphabet does not contain the empty string. 

The set A* is defined to contain the empty string as well as all strings formed with characters 
in A. A basic operation between two strings is concatenation: if x, y G A* then xy is defined to be 
the string formed by concatenating y after x. Since, by definition, xe = x and ex — x, in general 
\xy\ = \x\ + \y\. 

For x S A* and 1 < i < j < \x\, x[i..j] denotes the substring of x formed by all characters 
between and including the i-th and j-th character of x. We write x[i] as a shorthand for x[i..i]. Note 
that for 1,56 A*, we write xy[i..j] to refer to the string formed by concatenating x with y[i.. j] as 
opposed to [xy)[i..j] which refers to a substring of xy. 

For x,y G A* we write x — ...y to mean that there exists z G A* (possibly empty) such that 
x — zy. In this case we say that y is a suffix of x. Similarly, we write x = y... to mean that there 
exists z G A* such that x — yz and we say that y is a prefix of x. 

4.2 Regular Languages 

A language over A is any subset of A* ; we typically use the L to denote a language so L C .4* . We 
write |L| to refer to the cardinality of L, i.e., the number of strings contained in L. For example, 
\A\ is the number of alphabet characters. This is not be confused with the length of a string: for 
x G A* , \x\ refers to the length of x, however, |{x}| = 1 regardless of the length of x because {x} is 
a language consisting of a single string. 

Three standard operations, union, concatenation and star, are usually defined over languages. 
For Li,L 2 C A*, the union (Li U L 2 ) corresponds to the usual union of two sets, i.e., a string 
x G (Li U L2) if and only if x G Li or x G L2. The concatenation language L1L2 consists of all 
those strings of the form xy, with x G Li and y G L2. Finally, LJ is the language formed by the 
empty string and by any string that can be formed by concatenating a finite number of strings in 
Li. Mathematically, Lj = {e} U Li U L1L1 U L1L1L1 U . . . 

The class of regular languages is the smallest class of subsets of A* that contains all finite 
languages (i.e., languages consisting of a finite number of strings) and that is closed under the three 
standard operations. 

4.3 Deterministic Finite Automata 

A deterministic finite automaton (DFA) is a 5-tuple of the form G = (V,A,f,q,T), where V is a 
nonempty set, A is an alphabet, / : V x A^ V is a function, q G V and T C V. The terms V, f, q 
and T are called, respectively, the set of states, transition function, initial state and set of terminal 
states. 

In what follows G = (V, A, f , q, T) is a given DFA. G can be represented as a graph with vertex 
set V where a directed edge labeled with the character a goes from a vertex it to a vertex v if and 
only if f(u,a) = v. In particular, each vertex has out-dcgrcc |.4| and for all u G V and a G A 
there exists a unique edge labeled with the character a that starts at u. See figures [l] [2] [3] and [3] for 
examples of automata represented as directed labeled graphs. 

The visual representation of G facilitates the extension of the transition function / to the larger 
domain V x A* as follows. For x G A* define the path associated with x in G when starting at u 
to be the sequence of states that are visited from u by following the edges in G according to the 
labels appearing in x as they are read from left to right. In the special case that u = q (i.e., the 
path begins at the initial state), we refer to this path as the path associated with x in G. We define 
f(u, x) to be the state in V where the path associated with x ends when starting at u. Note that 
f(u, e) — u. As a result, / : V x A* — > V satisfies the following fundamental property: for all u G V 
and x, y G A*, 

f{u,xy) = f(f(u,x),y). (1) 
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In other words, the path associated with the concatenation of two strings can be determined by 
concatenating the paths associated with each string, provided that the end of the first path is used 
as the starting point of the second path. 

For u,v £ V, we say that v is accessible from u if there exists x G A* such that /(it, x) = v. 

The language recognized by G is defined as 

L(G) :={xeA*:f(q,x)eT}. 

In other words, L(G) consists of all strings that can be formed by concatenating from left to right 
the labels of the edges visited by any path that starts at the initial state of G and ends at some 
terminal state. 

In what follows we say that a language L is recognized by G provided that L = L(G). According 
to two classical results in computer science, Kleene's theorem and the Rabin and Scott theorem, the 
following holds [HU79l|Sip96l . 

Theorem 4.1 Let L C A*. L is a regular language if and only if there exists a DFA G such that 
L(G) = L. 

Consider two DFAs G x = (V lt A, fx, qi, T x ) and G 2 = (V 2 , A, f 2 , q 2 , T 2 ). We say that G x is 
isomorphic to G 2 (denoted G\ ~ G 2 ) provided that there is a bijection $ : V\ — ► V 2 such that 
$( gi ) = q 2: $(Ti) = T 2 , and for all u,v € V\ and a e A, fi(u,a) = v if and only if f 2 ($(u),a) = 
Q(v). We can think of the function $ informally as a relabeling of the states of G\ that produces 
the states of G 2 . Using (JT|), one can see that G\ ~ G 2 implies that for all u, v G V and x £ A* 

fi(u,x) = v^=> / 2 ($(u),x) = $(«). 

In particular, since <1> preserves initial states, the path associated with x in G\ ends at u if and only 
if the path associated with x in G 2 ends at <l>(u). Since $ also preserves terminal states, G\ and G 2 
recognize the same language. Therefore, isomorphic automata recognize the same regular languages. 

4.4 Synchronization 

In what follows, for a given language L, L c denotes the complement of L, i.e., L c := {x G A* : x (fc L}. 

Synchronization is an operation between two or more automata that can be used to construct 
a new automaton that has useful properties, such as recognizing multiple languages. To define 
synchronization, consider a finite sequence of regular languages L, ; , i = 1, ...,m, with m > 2. For 
each i let G t — (Vi,A,fi,qi,Ti) be a DFA that recognizes Lj. 

Definition The synchronized automaton associated with G\, . . . , G m is the automaton G\ X • • • X 
G m = (V, .4, <?, /, T) with 7 := Vj. x • • • x V m , q:=(qi,..., q m ) and T := {(it x , . . . , u m ) G V : S 
Tj for at least one i}. The transition function / : V x 4 — > V is defined as 

/(u,a) := (/i(ui,a), . . . ,f m {u m) a)), (2) 

for all u = (u\, . . . , u m ) G V and a£ A To each u = (ui, . . . , u m ) £ V we associate the language 




Synchronized automata are also called product automata. We can informally understand the idea of 
synchronization by imagining an automaton which works by simultaneously operating the automata 
Gi, . . . , G m : from the states u\, . . . , u m in the individual automata, we feed each automaton the 
character a. Then the transitions of the synchronized automaton are determined by combining all 
the transitions of the individual automata (which is what definition ^ conveys). See figure [4] for 
an example of a synchronized automaton. 

The key feature of synchronized automata is revealed by the following result. 

Theorem 4.2 If G\ x • • • x G m = (V, A, q, f, T) then for all u= [u\, . . . , u m ) £ V and x G A*, 
f{u, x) — (/i(ui, x), . . . , f m ( u m, x)). In particular, for all x G A*, x G L{f{q, x)). 
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Proof Fix u = (tti, . . . , u m ) <E V . We show the first part by induction on the length of x. Since the 
case |x| = is trivial, it suffices to show that if the identity holds for all strings of length n then it 
also holds for an x £ A* of length (n + 1). Indeed, according to Q, the inductive hypothesis and 
the definition of /, we have that 

f{u,x) = f(f(u,x[l..n]),x[n + l]), 

= f((fi(ui,x[l..n]), f m (u m ,x[l..n])),x[n + 1]), 
= (fi(ui,x),...,f m (u m ,x)), 

where we have used that fi(fi(v,i, x[l..n]), x[n + 1]) = fi(ui,x) in the last identity. This proves the 
first part of the theorem. 

For the second part, let x £ A* . According to the first part, f(q, x) — (fi(qi,x), . . . , f m (q m , %))■ 
Since Gi recognizes L^, x £ Lj if and only if fi(qi,x) € Tj. Consequently, x € Lij..f i ( q . :X y e xJ-'i an( l 
x Ui-f.^g. x ^Ti^i- This completes the proof of the theorem. □ 

The first part of the theorem states that the path associated with x in the synchronized au- 
tomaton is determined by the paths associated with x in each of the individual automata. A direct 
consequence of this is that f(q, x) G T if and only if there exists i such that fi(qi, x) £ T%. In other 
words, the synchronized automaton can reach a terminal state if and only if one (or more) of the 
individual automata reaches a terminal state. Since this is equivalent to having x £ L^, we see that 
G\ x ■ • • x G m recognizes the union language U™ 

The second part of the theorem asserts that the state where the path associated with a string 
ends indicates all the languages Li, . . . , L m to which that string belongs to. This permits to redefine 
the set of terminal states to recognize any language obtained via the intersections, unions and 
complementations of the languages Li, . . . , L TO . For instance, if we were to redefine T as 

jw e V : L(u) = Lt n (jj Li J , or L(u) = f) L 4 | 

then the resulting automaton would precisely recognize the language 




This feature of product automata is the key property used by computer scientists to show that the 
class of regular languages is the same as the class of languages recognized by DFAs (see |HU791 [5ip96 
for more details). In pattern analysis in random sequences, this property is important for studying 
patterns that include but also exclude certain features. 



4.5 Aho-Corasick automata 

This class of automata was defined by Aho and Corasick |AC75| to detect all the occurrences of a 
finite number of keywords in a general text. Aho-Corasick automata can be considered to be finite 
state machine implementations of the Knuth-Morris-Pratt string searching algorithm KJP77J. 

Definition Let W C A* be a finite non-empty set. The automaton AC(W) — (V,A,q,f,T) is 
defined as follows. V consists of the empty string as well as all prefixes of strings in W, q := e and 
T := W. The transition function / : V x A — > V is defined such that for u, v S V and a G A, 

f(u, a) = v v is the longest element in V such that ua = ...v. 

The main idea in the definition of the transition function / is the longest-prefix suffix rule: each 
state u £ V contains information about the longest prefix of a word in W that is at the same time a 
suffix of a text so far scanned by the automaton. See figures [T] and [2] respectively for a representation 
of AC({abba}) and AC({ba,abba}) as directed labeled graphs. 

The technique used in [AC75] to show the correctness of Aho-Corasick automata relies on the 
concept of non-deterministic finite automata. Here we present a new proof that is self-contained 
and relies only on first principles. The following result can be considered a rephrasing of Lemma 1 
in [AC75] . 
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[e] * [a] [ab] [abb] [abba] 




[£] ta] "-[ab] *[abb] [abba] 




Figure 1: The Aho-Corasick automaton AC({abba}) that recognizes the language {a, b}* {abba}. 
Top, the full Aho-Corasick automaton that finds all occurrences of abba in a text constructed using 
the binary alphabet {a, b}. The initial state is the empty string (left). The terminal state is abba, 
which corresponds to detection of the string abba in the text (right). Middle, the transitions that 
occur when the character a occurs in the text. Bottom, the transitions that occur when the character 
b occurs in the text. 



Lemma 4.3 For all x G A* , /(<?, x) = u if and only if u is the longest state in V such that x = ...it. 

Proof We show the lemma by induction on the length of x. Since the case |x| = is trivial, it suffices 
to show that if |x| = (n + 1) and the lemma applies to all strings of length n then it also applies 
for x. Let u be the longest string in V such that x = ...u. Let v = f(q,x[l..n\) and w — f(q,x). 
According to the inductive hypothesis, v is the longest string in V such that x[l..n] = ...v. To prove 
the lemma it is enough to show that u — w. In order to do so we first show that 

|w| < \u\ < \v\ + l. (3) 

For this observe that according to ([lj, 

f(v,x[n + 1]) = w. (4) 

In particular, since x = x[l..n]x[n+ 1] = ...ux[n+ 1], it follows from the above identity that x — ...w. 
The defining property of u implies the first inequality in (|3J. To show the second inequality, we 
proceed by contradiction. Suppose, counterfactually, that |u| > \v\ + 1. Since x — ...u and x — 
...vx[n+ 1], there would be a nonempty string y such that u — yvx[n+ 1]. In particular, since u €z V, 
yv must be a prefix of a string in W. Hence, yv G V. This is not possible because x[l..n] = ...yv 
and therefore v could not be the longest element in V with the property that x[l..n] = ...v. This 
contradicts the defining property of v and therefore the second inequality in ([3| must be true. 

Finally, we show that u = w. Since x = ...u = ...vx[n + 1], the second inequality in ^ implies 
that vx[n + 1] = ...u. Using Q, this implies that |u| < \w\ and therefore, according to the first 
inequality in ([3]), |ti| = \w\. Since x = ...u — ...w then u — w. This completes the proof of the 
lemma. □ 

A direct consequence of the above lemma is that the Aho-Corasick automaton AC(W) recognizes 
the language A*W. However, its terminal states satisfy an important property that is useful for 
counting occurrences of patterns in random strings. The theorem describing this property can be 
considered a rephrasing of Lemmas 2 and 3 in |AC75] . 

Theorem 4.4 For w S W define T(w) := {u G W : u = ...id}. For all w € W and x G A*, w 

occurs m times as a substring of x if and only if the path associated with x in AC{W) visits the set 
T(w) exactly m times. 

Proof Suppose that w occurs m times as a substring of x and that the path associated with x in 
AC(W) visits T(w) exactly I times. To prove the theorem it suffices to show that m = I. Indeed, 
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according to Lemma 4.3 if for some < i < \x\, f(q, x[l..i\) = u G T(w) then ar[l..i] = 



.u = ...w. 



In particular, m > I. On the other hand, suppose that for some 1 < i < |a;|, = ...w. Let 



u = f(q,x[l..i]). According to the Lemma 4.3 u is the longest string in V such that x[l..i] = ...u. 
Since w € V and x[l..i] = ...w, it follows that \u\ > \w\. In particular, u — ...w and therefore 
u £ T(w). This shows that m < I and hence m = I. This completes the proof of the theorem. □ 

This theorem means that the Aho-Corasick automaton can be used to count the number of 
occurrences of each of the keywords it searches for. In other words, we can use the Aho-Corasick 
automaton to construct an automaton that correctly matches any arbitrary set of strings W. This 
eliminates the need for the commonly used requirement in the analysis of random strings that 
compound patterns be reduced. A finite set of strings W is said to be reduced provided that no 
string in W is a substring of another string in W. In this case, T{w) = {w} for each w € W and 
therefore occurrences of w in a text are in one-to-one correspondence with the visits to state w as 
the automaton AC(W) processes the text. However, in order for this last property to hold, it is 
enough that W is suffix-reduced, i.e., no string in W is a suffix of another string in W. This follows 
directly from Theorem |4.4| because for all w <G W, T{w) = {w} precisely when W is suffix-reduced. 

We finish this section with some remarks regarding the computational complexity of Aho-Corasick 
automata. This type of automaton can be implemented in time and space proportional to the sum 
of the lengths of all words in W. Furthermore, in the case of keyword sets with a single string, 
Aho-Corasick automata turn out to be minimal: for all w £ A*, AC({w}) is the automaton with 
the smallest number of vertices that recognizes the language ^4*{u>}. 

We note that many algorithms other than Aho-Corasick can search for a set of keywords. See 
[HU791 ICR02] for more information. See [LRD+05 for an account of minimization algorithms that 



can be used to reduce the number of states of a given DFA. 



5 Markov chain embedding 

In this section, we extend the mathematical notation and definitions introduced above to describe 
random walks on automata, a procedure referred to as the Markov chain embedding. This procedure 
is the key step required to move from deterministic to probabilistic pattern matching, which is 
essential for the determination of the statistical significance of genomic motif searches. This section 
gives a self-contained presentation of the key mathematical results and proofs. 

5.1 Mathematical results 

As before, A is used to denote a generic alphabet. We introduce the concept of a random text 
X = (A„)„>!, a sequence of „4-valued independent and identically distributed random variables. 
The distribution of X\ in A is denoted as Prob(-); in particular, for all n > 1 and a g A, Prob(a) 
corresponds to the probability that X n = a. We also define 

A + := A* \ {e}. 

In other words, A + is the set of all non-empty words formed by concatenating characters in A. 

The following definition formalizes the notion of Markov chain embedding as used in the literature 
by most authors. 

Definition Let G = (V,A,f,q,T) be a deterministic finite automaton. Define V G :— f(q,A + ),i.e., 
V G is the set of all states in u € V for which there exists x € A + such that f(q, x) = u. The Markov 
chain embedding of X in G is the sequence of V -valued random variables X G := {X G ) n >\ where 

X°:=f(q,X 1 ...X n ) (n>l). 

Recall that a sequence Y = (Y n ) n >i of V G -valued random variables is said to be a first-order 
homogeneous Markov chain provided that for all n > 1 and u\, . . . , u n ,v E V G , 

P(Y n+ i = v \Y n = u„,...,Yi = ui) = P(Y n+1 = v | Y n = u n ), 

and this last probability does not depend on n. The following theorem allows automatic computa- 
tion of many statistics associated with patterns in random strings by connecting the probabilistic 
calculations to the behavior of first-order homogeneous Markov chains defined on the state space of 
an appropriate automaton. 
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Theorem 5.1 If X = (X n ) n >i is a sequence ofi.i.d. A-valued random variables and G = (V, A, /, q, 
T) is a deterministic finite automaton then X G is a first-order homogeneous Markov chain with 
initial distribution 

P{X G = u) = V Prob{a) (ueV G ), (5) 



aeA:f(q,a) 



and probability transitions 



P{X G +1 = v | X G = u) = ]T Prob ^ (u,veV G ). (6) 

a(^A:f(u,a)—v 



Proof The proof of (|5| is direct. To show the Markov property observe that according to (111). 

^ n+l 



X G +1 — f(X G , X n+ i). As a result, for all ui, . . . , u„, v £ V it applies that 



P(X G 


= Ui, . 


Y G 


= u n,X G +1 = V) 


P(X G 


= "l, . 


Y G 


= Un, f{u n , -Xn+l) = v ), 


P(X G 


= m, . 


Y G 


= Un) ■ P(f(u n ,X n+1 ) = v), 



where for the second identity we have used that X n+ i is independent of X±, . . . ,X n . This shows 
that X G is a first-order Markov chain. Furthermore, since the distribution of X n+ i does not depend 
on n, it follows that the conditional probability P(X G +1 = v \ X G = u n , . . . ,XP — Ui) depends 
only on u n and v but not n. This shows that I is homogeneous. Therefore (K3| follows almost 
immediately. This completes the proof. □ 

This theorem describes a random walk on the vertices of the automaton, where the probability 
of a transition along an edge labeled with the character a is Prob(a). In other words, a transition 
that occurs in the deterministic automaton in response to reading character a occurs randomly with 
probability Prob(a). Therefore, the random walk can be represented by a first-order Markov chain, 
where the transition probability depends only on the current state and not on the preceeding states. 
A direct consequence of this theorem is the following simple way to construct the transition matrix 
of the Markov chain. To state the result we use Iverson's brackets: if p is a statement then [p] = 1 
provided that p is a true statement, otherwise [p] = 0. 



Corollary 5.2 If G and X are defined as in Theorem 5.1 then the probability transition matrix of 
X G in V x V is given by the formula 

P G = J- Prob(a) ■ G a , (7) 

where G a is the V xV matrix such that for all u,v € V, G a (u,v) — [/(u, oi) = vj. 

In the above result, G a corresponds to the incidence matrix of G where only edges labeled with 
the character a are considered. See the middle and bottom part of figure [l] for a representation of 
G a with G — AC({abba}) and a = a or b. 



5.2 Prototype application of the Markov chain embedding 

Theorem |5.1| allows the calculation of the statistical significance of matches of a regular pattern 
in a random string. To understand this application of the theorem, consider a random text, i.e., 
a sequence X — (A„)„>o of i.i.d. random variables taking values in some alphabet set A. The 
patterns to be matched arc represented as a finite number of distinct regular languages Li, . . . ,L m 
in A* . We then define matches to each language as 

5™ := number of substrings of Xi...X n that belong to Lj. 

For each language (different j) we construct an automaton Gj that recognizes the regular language 
A*Lj and let Tj denote the set of terminal states of Gj. Define the synchronized automaton con- 
structed from the Gj, G := G\ X . . . X G m and let T denote the set of terminal states of G. According 
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to Theorem |4.2| there are mj (possibly overlapping) substrings of Lj in X\...X n provided that the 
Markov chain (X G )i—i.. n visits the set of states T(Lj) exactly mj times. Therefore, if we define 

T" := number of times that (X G ) i=1 ,, n visits T(Lj) 

then it follows that the vector of substring counts (the Sj) is equal to the number of times the 
Markov chain visits the corresponding terminal states: 

V X 5***5 °m) — y 1 ! 5 • ■ ■ 5 J-m)- 

In particular, the distribution of (Si, ... , S^) can be completely studied in terms of the distribution 
of (T™, . . . , T"), to which we can apply the theory of Markov chains. 

Several refinements of the above method are possible for different tasks. For instance, if we are 
interested in forbidden patterns, the probability that no substring of X\...X n belongs to U^Iy 
corresponds to the probability that (T™, . . . , T") = (0, 0). In addition, the over- or underrepre- 
sentation of patterns described by the languages Li and L2 given the vector of counts for languages 
L3, . . . , L m could be studied in terms of the joint distribution of (T™, . . . , T£) and the marginal 
distribution of (T3 1 , . . . , X^). Finally, the aggregated number of occurrences of strings in Uj^Lj as 
substrings of X\...X m corresponds to the total number of visits that (X G )i = i,, n makes to T. 

The particular form of the product automaton Gi x • • • x G m we have been using is sometimes not 
computationally efficient. Indeed, in many situations the product automaton has a computationally 
intractable number of states. The key mathematical property of this automaton is that its states 
are associated with the detection or non-detection of each of the languages Li, . . . , L m . This allows 
one to determine the distribution of (5™, . . . , S 1 ," ) in terms of the Markov chain X G . However, this 
property is not exclusive to product automata. Other authors have proposed automata with similar 
characteristics called Marked automata and that can be used in the context of regular languages 
and random strings modeled by Markov sources [NSF02 Nic03 . For a related discussion sec Lla07 
where a synchronization argument is used to construct the smallest state space size automaton 
required for analyzing the number of matches with a regular pattern in a random string generated 
by a Markov source. 



6 Application to a compound pattern 

This section considers a prototype example for studying the sooner-time and frequency statistic of a 
possibly non-reduced compound pattern in random strings produced by memory less source. In this 
context, any string in a compound pattern counts as a match. Potential applications of this appara- 
tus include the study of RNA motifs, in which the compound pattern might include a degenerate base 
(e.g., the symbol R stands for either of the two purines, A and G, so the sequence CCRU represents 
the compound pattern {CCAU, CCGU}), or by base pairing (e.g., the sequence 1GAAAV — with 
A' := U, C := G, G' := C or U and U' := A or G — allows the first and last nucleotide to pair with 
each other, the compound pattern is {AGAAAU, CGAAAG, GGAAAC, GGAAAU, UGAAAA, 
UGAAAG}). 

We will use two patterns on a binary alphabet to illustrate the main principles. Consider the 
alphabet A — {a, b} and let X — (X n ) n >i be a sequence of i.i.d. ,4-valued random variables with 
initial distribution P(X\ = a) = p and P(X\ = b) = q, with p ■ q > and p + q = I. In this example 
we study the occurrences of the patterns ba and abba in X. 

For the rest of this section, G denotes the Aho-Corasick automaton AC ({ba, abba}). A visual 
representation of G is given in figure [2] 



According to Theorem 5.1 X G is a first-order homogeneous Markov chain with states a, b, ab, 
ba, abb, abba which we label respectively as 1, 2, 3, 4, 5, 6. From ([5]), ([6]) and ^ it follows that X G 
has an initial distribution given by the vector 

fj, := [ p q ] 
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Figure 2: The automaton that recognizes the non-reduced compound pattern {ba, abba} in a text 
constructed using the binary alphabet {a,b}. Top, the Aho-Corasick automaton AC{{abba, ba}) 
which detects all occurrences of ba and abba in a binary text. The initial state is the empty string 
(left), and the terminal states are ba and abba (right). Bottom, representation of the first-order 
homogeneous Markov chain associated with a random text embedded in the automaton on top. The 
text is produced by a memoryless source where the character a occurs with probability p and the 
character 6 occurs with probability q. The Markov chain starts at state 1 with probability p and 
at state 2 with probability q. The probability that abba occurs in a random string of length n is 
equivalent to the probability that the Markov chain visits state 6 within (n — 1) steps. Similarly, 
the probability that ba occurs in a random string is equivalent to the probability that the Markov 
chain visits states 4 or 6. 
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A visual representation of X G is displayed in figure [2] 

According to Lemma 4.3 each occurrence of abba in X G corresponds to a visit to state 6. On the 
other hand, each occurrence of ba which does not contribute to an occurrence of abba corresponds 
to a visit to state 4. In particular, all occurrences of ba in X G correspond to visits to states 4 and 6. 



6.1 Sooner-time distribution of two non-reduced patterns 

Broadly speaking, the sooner-time of a pattern corresponds to the position of the first occurrence 
of the pattern in a random text. Potential applications of our apparatus include the analysis of 
the occurrence of any one of a set of completely different RNA patterns that can catalyze the same 
reaction, such as RNA self-cleavage [TB00 . 
Define 

T := sooner-time distribution of ba or abba, 
i.e., T is the smallest n such that X\...X n = ...ba or X\...X n — ...abba. To study the distribution of 
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T consider the matrix and vectors 

p q 

q 

q 

q 



[ P <Z ] 



o 




" " 


p 







p 


; v := 










. P . 



The matrix Q is obtained by removing the fourth and sixth rows and columns of the probability 
transition matrix P, i.e., the entries associated with the patterns ba and abba. The vector v corre- 
sponds to the vector fj, with the fourth and sixth columns removed. The vectors u and v correspond 
to the fourth and sixth columns of P with the fourth and sixth rows removed. 

Each entry in any power of Q is an aggregate probability, the probability that X G follows certain 
paths that avoid any edge that is incident to states 4 or 6. The entry in row r and column c of Q n 
corresponds to the probability that X® = c and (X^ n does not visit states 4 and 6, given 



that Xp = r. Since the entry in row r of (u + v) corresponds to the probability that X^ +1 — 4 or 6 
given that X G 



r, it follows that 

Prob[T = n] = v ■ Q n ~ 2 



(u + v) (n>2). 



(8) 



The above expression can be rewritten in terms of the generating function of T. Let I4 be the 
(4 x 4) identity matrix. Since 



z n ■ Q" 



1*1 <1). 



(9) 



71=0 



it follows from ([8| that 



00 

E 

n=2 



Prob[T = n] z n = z 2 ■ v ■ (I 4 - z ■ Q)" 1 • (u + v). 



Using the cofactor formula to invert the matrix on the right-hand side above, we obtain for this 
example that 



^Prob[T = n] z n = - 



pqz- 



71=2 



(1 -pz)(l - qz)' 



(10) 



Before continuing we introduce some standard notation |Wil94, FS06 . In what follows, wherever 
f(z) is a power series in the variable z, i.e., f{z) — X^^Lo w ith /o, fx, • ■ ■ complex numbers, 
the coefficient of z n of f(z) is denoted [z n )f(z). Specifically, \z n ]f(z) := f n . For instance, using a 
geometric series argument, it follows that 



/3 z 



(a,P^0;\z\ < \a/(3\). 



In particular, [z n ]l/(a - f3z) = (3 n a - < - n+1 \ Via successive differentiation of both sides above with 
respect to the variable z, one obtains for all integer m > 1 the following well-known formula [Wil94 



(3 m a 



- (n+m) 



[ (a-fizy 



(m-1)! 



(a,P^ 0;n > 0). 



(11) 



To obtain an explicit formula for Prob[T = n) we use the partial fraction decomposition of the 
right-hand side of equation ( 10| an d then (11 1 to extract the coefficient of z n in each of the terms 
of the decomposition [Wil94 FS06], For instance, if p ^ q then the partial fraction decomposition 



of the right-hand side in (j 1 0|) leads to the identity 

q 



£Prob[T=n] z" = - 



P 



71 = 2 



(l-pz)(p-q) (l-qz)(q-p) 



1. 



As a result, using (111 to identify the coefficient of z n in each of the terms on the right-hand side 

q-p" 



above, it follows that 



Prob[T = n] = 



P - 



p-q 
q 



(n>2;p^q). 
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On the other hand, if p = q then 



A A 

$>ob[T = , ],«=- — + !, 

n=2 y > 

and therefore 

TL — 1 

Prob[T = n] = (n>2;p = q). 

To study the probability of which of the patterns ba or abba is the first to be observed, notice 
that by definition of T, X\...Xt — ---ba or X\...Xt = ...abba. To determine the probability that ba 
is observed before abba, or the probability that ba and abba are observed simultaneously for the first 
time we use that 

P(T = n,X-L...X T ^ -abba) = v ■ Q n ~ 2 ■ u (n > 2), 
P(T = n,X-L...X T = ...abba) = v ■ Q n ~ 2 ■ v (n > 2). 

Since det(I 4 - Q) = (1 -p) ■ (1 - q), it follows from M that 



P{X 1 ...X T ^ ...abba) = v ■ (I 4 - Q)" 1 



P(Xi...X T = ...abba) = v ■ (I 4 - Q)" 1 • u. 

Using symbolic algebra software to evaluate the right-hand side of the above identities we find that 
the probability that ba is observed before pattern abba is (1 — p 2 q). The probability that ba and 
abba are observed simultaneously for the first time is therefore p 2 q. 



Table 1: Joint distribution for the frequency statistics 5g and S 2 as defined in (12) and (131, 
respectively. Since the probabilities in the third column add up to one, no other combination of m\ 
and ?TJ2 is possible for a random binary string of length 6. The probabilities in the third column 
can be computed via matrix multiplication using identity ( |14[) , or by extracting the coefficient of 
x6 uT 1 yT 2 of the generating function F(x,yx,y2) in (15) or (|16|. 



mi 



1 

2 

3 

1 

2 



m 2 









1 
1 



Probability that (Sq,S%) — (mi,m 2 ) 

,6 



p 3 q 3 -+ 
7p 3 q 3 
5p 3 q 3 
p 3 q 3 
2p 3 q 3 
2p 2 q 4 



P 



p 5 q- 



7pV 
Ap 2 q 4 



p 2 q 4 4 
Ap 3 q 3 



p 4 q 2 
f 5pq 5 H 
f 6p 4 q 2 

3p 4 q 2 



t P 2 Q 4 
5p 4 g 2 - 



-pq" 4 



Table 2: The joint distribution of S$ and 5| displayed in Table [T] permits the calculation of the 
conditional distribution of Sg given S 1 ! = 0. In particular, the expected value and variance of the 
number of occurrences of ba as a substring of X\...X§ can be reassessed when the pattern abba is 
known to not occur as a substring of the random string. 

nil Probability that Sq = mi given that Sq = 

i 5p-18p 2 +29p 3 -24p 4 + 13p 5 -5p 6 

l-3p 2 +6p 3 -3p 4 
r, 5p 6 -13p 5 + 15p 4 -llp 3 +4p 2 

l-3p 2 +6p 3 -3p 4 
o p 3 -3p 4 +3p 5 -p 6 
° l-3p 2 +6p 3 -3p 4 



6.2 Frequency statistics of two non-reduced patterns 

In this section we study the joint distribution of the number of occurrences of the patterns ba and 
abba in Xi...X n . Observe that {ba,abba} is not a reduced set of patterns because ba is a suffix of 
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abba. One might be interested in non-reduced patterns, for example, in studying the combinatorial 
function of miRNA seed sequences, which may or may not act together to regulate gene expression 
during translation [LBB05J. In order to test whether the occurrence of two seeds is correlated, we 
would need to first calculate the null distribution if there were no functional relationship: even in 
the absence of biological effects, the probability of observing one seed might affect the probability 
of observing the other (for example, if one seed were to overlap the other). For this example, again 
demonstrated on the two-letter alphabet, consider the random variables 



S" 1 

Si-, 



number of times that ba occurs as a substring of X\...X ni 
number of times that abba occurs as a substring of Xi...X n . 



(12) 
(13) 



The argument to be presented here could also be used to study the distribution of (S^ — S 2 ,S 2 ), 
where (S^ — S 2 ) corresponds to the number of times that ba appears as a substring of X\...X n but 
without contributing to an occurrence of abba as a substring. 

The notation introduced in section |6.1| will be extended to consider power series in several vari- 
ables. For instance, if g(x,y) = X)^°m=o 9n,mX n y m , with (g n ,m)n,m>o an array of complex numbers, 
we define [x n y m ]g(x, y) := g„ t7n . 

To study the joint distribution of S„ and S 2 we use a transfer matrix method |FS06[ IG J04] . 
Consider the matrix with polynomial entries and the vector 
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The matrix P yi tV2 is obtained by multiplying the fourth and sixth column of P by y\ , and the sixth 
column of P by y 2 . Observe that if y% = 1 and 2/2 = 1 then the entry in row r and column c of 



pn 

yi,V2 



corresponds to the probability that X% = c given that 



This is because the entries 



in P n correspond to the aggregate probability of all possible paths of length n that start at state r 
and end at state c. The entry in row r and column c of P yi y2 is a polynomial in the variables y\ 
and y2- The coefficient of y™ 1 ?/™ 2 ^ s the aggregate probability of all paths of length n that start at 
7-, end at c, and visit mi times the set of states {4,6} and 7712 times the set {6}. As a result, for 
n > 1 and mi , to 2 > one finds that 



Prob^ = mi, Si = m 2 ] = [y?'y^ ■ P^ y \ ■ S) 



(14) 



This suffices to determine the joint distribution of and S„ for small values of n. See tables [T] 
and [2] for specific computations in the case of n — 6. 
Define 

00 00 mi 

F(x, yi ,y 2 ) := E E E Prob [ 5 « = m i> 5 « = ^\x n yTvT- 

n=l mi=0 ra<2— 



In terms of generating functions, identity ( 14 1 figure 



F(x,y 1 ,y 2 ) = x ■ [i ■ (I 6 - x ■ P VuV2 ) 1 • 5, 



(15) 



where 16 is the (6 x 6) identity matrix. The matrix on the right-hand side above can be determined 
in closed form using symbolic algebra software. By doing so one derives that 



F(x,y 1 ,y 2 ) 



pq 3 yi(l - y 2 )x A +pq(yi - l)x 2 + . 



Pq 3 yi(U2 ~ l)x 4 + pq 2 yi(l - y 2 )x 3 + pq(l - y x )x 2 - x + 1 



(16) 



According to the definition of F(x, yi,y 2 ), the coefficient of x n y™ x y 2 n ' 2 on the right-hand side above 
corresponds to the probability that (S^, S 2 .) = (mi,m 2 ). For small values of n this allows a direct 
calculation of the joint distribution of 5^ and S 2 by determining the Taylor coefficients of F(x, y\ , y 2 ) 
about (x, 2/1,7/2) = (0,0,0). 
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For large values of n an asymptotic analysis of the joint distribution of S„ and S 2 is more 
appropriate. This follows in the general context of linear (also called additive) functionals of Markov 
chains that we briefly describe next. For this consider an integer d > 1. In what follows, d- 
dimensional vectors are thought of as column vectors. For a e?-dimensional vector c, we write d to 
refer to the transpose of c. Consider a vector-valued transformation / = (/i, . . . , fd)', where each 
entry /j : V G — > K is a given function. We are interested in the asymptotic behavior of the random 
variables 

n 

:=£/(*?). 

i=l 

Define the d-dimensional vector and (d x d) matrix 



M = 



lim — 



lim — 

n — >oo Ji 



E(S£) 

E(S£) 
Var(S£) 

Cov(S&S£) 



Cov(S[\St) 
Var(S£) 



(17) 



(18) 



Whenever X G is an aperiodic and irreducible first-order homogeneous Markov chain in a finite state 
space, the entries in [x and E above are finite and do not depend on the initial distribution of 
X G |Che99l IJon04j . In particular, E is a semi-positive definite matrix, i.e., d • E • c > for all 
d-dimensional vector c. Furthermore, the aperiodicity and irreducibility of X G implies that (S£ — 
nfi) I \fn converges in distribution to a centered e?-dimensional normal distribution with variance- 
covariance matrix E. (This follows from the Cramer- Wold device [Sha03 and the general results 
in |Che991 [Jon04] .) This means that for each pair of real numbers a < b and d-dimensional vector c 
such that d ■ E • c > 0, 



lim Pi a < d ■ Sn " M < b 



1 



y/2n(c' • E • c) 

rb 

exp 



2{d • E • c) 



dx. 



In addition, if det E > then 



lim P 



Si, 



e e 



(27r-detE)' i / 2 J B 



exp 



x' ■ E" 



dx. 



for all measurable sets 6cl li whose boundary d<9 is of Lebesgue measure zero. 

In the context of the frequency statistics (S*, S 2 ) consider the function / = (fi, / 2 ), with fi(x) — 
fx e {4,6}] and f 2 (x) = [x = 6]. Since = S^ 1 and S 2 = S[ 2 , a central limit theorem for the 2- 
dimensional vector (S* 1 , , S 2 )' is feasible provided that the quantities in (17 1 and (18 1 are computable 
and the (2 x 2) matrix E is positive definite. For this we differentiate the generating function 
F(x,yi,y2) to obtain 
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dyi 

dF 
d 2 F 



n=l 



dyl 

d 2 F 
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(a;, 1,1) = 
(1,1,1) = 
(a:, 1,1) 



(1,1,1) = 



(1 -x) 2 ' 

9 2 4 

p q x 
(1-x) 2 ' 

2p 2 g 2 ^ 4 

: (l-x) 3 ' 

2pV(l - gx)x 7 
(l-x) 3 : 



(19) 
(20) 
(21) 
(22) 
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Y^E{Sl-Sl)x n 



d 2 F 



(1,1,1), 



dym 

p 2 q 2 {\ — q{q — p)x 2 — px)x 
(1 - x) 3 ' 



(23) 



The coefficients of each of these generating functions can be easily extracted using pT| . Furthermore, 
since for all random variables X and Y with finite second moment it applies that Var(X) = E(X • 
(X - 1)) - E(X) • (E(X) - 1) and that Cov(X, Y) = E(X • F) - E(X) ■ E(Y), one can deduce from 
(111 and ( 19 )-( 23 1 the following asymptotic formulae as n — > oo 



(24) 
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(28) 
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= (7p pq 
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5 + V 



These identities make explicit the terms in (17) and (18). Furthermore, using symbolic algebra 

28p 4 - 16p 5 +4p 6 ), 



software and replacing q = 1 — p one can determine that 
det S = p 3 (l - p) 3 (l -5p+ Up 2 - 25p 3 



which is strictly positive for < p < 1. Therefore ((S^, S 2 )' — n\i)l\fn converges to a 2-dimensional 
centered normal random vector with variance-covariance matrix E, where (i and £ can be determined 
from ( 24 (-(28 1 as defined in ( 17 1 and (18 1. For instance, if p = q = 1/2 then 



H = 16 



256 ■ 



16 4 
4 13 



7 Application to correlated modular patterns 

Correlated, modular patterns are important in the analysis of RNA motifs. Many functional 
molecules can be represented by seque nce motifs made up of modules separated by relatively un- 
constrained spacer sequences |BFP + 99] . This modularity implies that there are many more chances 
to match the pattern within a longer sequence than would be possible for a simple pattern or 
moderate-size compound pattern [SUB97I IKY031 lKDSM+05 . This fact can greatly alter estimates 



of the statistical significance of matching such a pattern. The correlations between modules primar- 
ily take the form of base pairs, which are essential for bringing the parts of the active site into the 
structural juxtaposition required for function. 

More generally, in many kinds of biological sequence analysis one is interested in patterns that 
include correlations or gaps. We use numbers to denote correlations. For example, in the case of 
the binary alphabet {a, b}, 

lo2a261 = {aaaaaba, aababba,baaaabb, bababbb}, 

where either a or b could appear in the positions marked 1 and 2. 

A gap of length exactly k is denoted as #^ whereas a gap of length at least k is denoted 
The symbol # is used as a shorthand for #i; in particular, ^ = #•••# k times. If the symbols 

j^k or #>fe appear more than once in the same pattern each appearance is independcndcnt. For 
instance 

la^2^1 = {aaaaba, aaabba, aababa, aabbba, baaabb, baabbb, bababb, babbbb}. 
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Figure 3: The automaton that recognizes the occurrence of the modular pattern aa#...ba in a text 
constructed using the binary alphabet {a, b}. Top, automaton NC(aa#...ba) which counts non- 
overlapping occurrences of aa#...ba. The initial state is e\ (left), and the terminal state is ba (right). 
The symbol # corresponds to any of the two alphabet characters. Bottom, representation of the 
first-order homogeneous Markov chain associated with a random text embedded in the automaton 
on top. The text is produced by a memoryless source where the character a occurs with probability 
p and the character b occurs with probability q. The Markov chain starts at state 1 with probability 
q and at state 2 with probability p. The probability that there are m non-overlapping occurrences of 
aa#...ba in a random text of length n is equivalent to the probability that the Markov chain visits 
state 6 a total m times in the first (n — 1) steps. 

Finally, ab=f/=...baa=f/=2---bb is the set of all strings of the form abxbaaybb where x, y £ {a, b}* are such 
that |ar| > 1 and \y\ > 2. This pattern consists of an infinite number of strings. In this case we refer 
to ab, baa and bb as the modules of the pattern. 

As in the previous section, for this example we consider the binary alphabet A = {a, b} and let 
X = (X n ) n >i be a sequence of i.i.d. ,4-valued random variables with initial distribution P{X\ = 
a) = p and P(X\ = b) = q, with p ■ q > and p + q = 1. We study the number of non-overlapping 
occurrences of the pattern aa#...ba in Xi...X n , and the sooner-time of the pattern la#...61 in X. 

7.1 Frequency statistics of a modular pattern 

In addition to modular patterns that contain correlations through base pairing, modular patterns 
that do not (as far as is currently known) require base pairing are also important for processes 
involving RNA. For example, transcriptional regulation requires combinatorial regulation of binding 
sites for transcription factors that activate and repress genes; splicing regulation requires specific 
combinations of splicing enhancers and repression; and microRNA targeting appears to be combi- 
natorial [SRK06 . Many existing software packages for detecting overrepresented words, such as the 
MobyDick package [BLSOOJ, identify words that are surprisingly common given the partition func- 
tion by which they could be comprised of shorter words, but fail to take into account correlations 
between word abundances that could be caused by partial overlap of words of the same length. 

To detect non-overlapping occurrences of the pattern aa#...6a in a general text we first seek an 
automaton that detects the language A* aaAA*ba. This can be accomplished by concatenating the 
Aho-Corasick automata AC({aa}) and AC({ba}): we concatenate the terminal state of AC({aa}) 
with the initial state of AC({ba}) with two edges, one labeled with the character a, and the other 
with the character b (which we represent visually as a single edge labeled with the character 4f)- 
The resulting automaton is denoted as AC (aa# . . .ba) . By definition, the initial and terminal state 
of AC(aa#---ba) are the initial state of AC({ab}) and the terminal state of AC({ba}), respectively. 
The fact that AC(aa#---ba) recognizes the language A*aaAA*ba follows from Theorem |4.4| Let e\ 
and €2 denote the initial state of AC({aa}) and AC({ba}) respectively. 

To detect each non-overlapping occurrence of aa#...6a in the random string X\...X n we convey 
into the terminal state of AC(aa#...ba) the transitions of its initial state as follows: first remove 
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all edges coming out from ba, add an edge labeled with the character a from ba to a, and also add 
an edge labeled with the character b from ba to t\. We refer to this automaton as N C {aa^ ■ ■ -ba) , 
or NC in short. Here N stands for non-overlapping and C for counting. See figure [3] for a visual 
representation of this automaton. 

The fact that NC(aa#...ba) detects each non-overlapping occurrence of the pattern aa#...ba 
in a general text follows from the correctness of AC({aa}) and AC({ba}) and the way these two 
automata were concatenat ed. 

According to Theorem 5.1 X NC is a first-order homogeneous Markov chain with states e 1; a, 
aa, £2, b and ba, which we label respectively as 1, 2, 3, 4, 5 and 6. The initial distribution and 
probability transition matrix of X NC are 



H := [ q p 



P := 
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A visual representation of this Markov chain is displayed in figure [3] 
For n > 1, we define 



number of non-overlapping occurrences 
of aa#...ba as a substring of X\...X n 



The distribution of S n corresponds to the number of visits that (Xj vc )i = i,, n makes to state 6. The 
transfer matrix method used in section [6. 2| can now be used to characterize the distribution of S n . 
We therefore mark the edges that are incident to state 6 with a dummy variable y that keeps track 
of the number of times that this state is visited. This is equivalent to considering the matrix and 
vector 

q p 
q p 
1 
p q 
q py 




Py'= 



q p 



o o 



S := 



As in (14 1, it follows that 

Prob(S' n = m) = [y m ](n ■ P™" 1 ■ 8) (n > 1; m > 0). 
This result suffices to determine the exact distribution of S n for small values of n. Furthermore, as 



in (15), the generating function associated with S n is 
F(x,y) := 



oo oo 

EE 

n— 1 771—0 



P(S n = m)x n y r - 



X ■ fl ■ (I 6 - X ■ Py) 1 • 5, 

x(p 3 qyx i 



9 ^ 

p qx 



qx+1) 



1 — [p + 2q)x + qx 2 + p 2 qx 3 



p 2 q 2 x 4 



p 3 qyx 5 ' 



where the last identity was determined by using symbolic algebra software to invert the matrix 
(Ie-x-P y ). 



As in section 6.2 asymptotic formulae for E(S'„) and Var(S' n ) can be obtained using the partial 



fraction decomposition of ^(x, 1) and 1)- Since X NC is irreducible and aperiodic, 

S n — E(5 n ) 
v/Var(5„) 

can be shown to converge to a standard normal distribution. 
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a a # b a 
[e fa - M *laa] -fej «tb] -[ba] 

O l o o o 

b a b # 

Followed with probability p 

Followed with probability tj 

Followed with probability I 

Figure 4: The automaton that recognizes the correlated modular pattern la#...61 in a text con- 
structed using the binary alphabet {a, b}, where the symbol 1 can be either a or b but must be the 
same character in both places. The grid in the middle is a visual representation of the synchronized 
automaton ST(la#...bl), where horizontal axis is the automaton ST(aa#...ba) and the vertical axis 
is the automaton ST(ba#...bb). In the probabilistic automaton, blue edges are followed with prob- 
ability p, red edges are followed with probability q, and black edges are followed with probability 
1. The terminal states 13, 15, 19, 22, and 24 (top row) correspond to occurrence of the pattern 
ba#...bb but not aa#...ba, while terminal states 16, 20 and 23 (right column) correspond to occur- 
rence of the pattern aa#...ba but not ba#...bb. The terminal state 25 (top right corner) corresponds 
to recognition of both patterns. 

7.2 Sooner-time of a correlated modular pattern 

Many functional RNAs must occur in a specific sequence context in order to function. For example, 
riboswitches (RNA molecules that regulate certain genes) must appear immediately upstream from 
the start of the coding sequence [WNB02 . A related example is IRE, the iron-responsive element in 
the ferritin mRNA, which binds a protein cofactor to enhance the transcription of genes involved in 
iron metabolism |HCR + 87] , In these cases, we are interested in the distribution of first occurrences 
of a modular RNA pattern, including correlations, relative to a specified start site. 
To illustrate calculations of this type, define 

T := sooner-time of la#...M in X . 

In other words, T is the smallest n such that Xi...X n = ...aa#...ba or X\...X n = ...ba#...bb. To 
study the distribution of T, we synchronize any automata that recognize the languages A*aaAA*ba 
and A*baAA*bb. Consider the automaton AC(aa#...ba) as defined in section [7T| Similarly define 
AC(ba#...bb). Since we are only interested in the number of occurrences of either of the patterns 
aa#...ba or ba#...bb, we turn the terminal states of these automata into absorbing states. This is 
accomplished by resetting all the edges coming out from terminal states to point to themselves. We 
refer to the resulting automata as ST(aa#...ba) and ST(ba#...bb) respectively, where ST is short 
for sooner-time. A visual representation of these automata can be found in figure [4] 

Define ST(la#...M) to be the product of the automaton ST(aa#...ba) with ST(ba#...bb). In 
principle, ST(la#...61) has 36 states, however, only 25 of these are accessible from the initial state. 
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For example, according to Lemma |4.3[ there is no string x £ A* such that the path associated with 
x in ST(la#...bl) ends at state (aa,ba). A visual representation of ST(la#...bl) reduced to only 
those states that are accessible from the initial state is displayed in the middle grid in figure [4] where 
for convenience we have relabeled the accessible states as 1, . . . , 25. 



According to Theorem 4.2 the pattern la#...bl occurs in a text provided that the path associated 
with the text in ST(la#...61) ends at any of the states 13, 15, 16, 19, 20, 22, 23, 24 and 25. 
Furthermore, the sooner-time of la#...61 corresponds to the first time that any of these states is 
visited. 

To characterize the distribution of T, consider the first-order homogeneous Markov chain X ST . 
We denote the initial distribution and probability transition matrix of X respectively as fi and 
P. Here ^ is a row vector of dimension 25. The matrix P has dimensions 25 x 25 but is sparse 
(in each row there are only two non-zero elements). By means of X the distribution of T can be 
determined as shown in section |6.1|for the sooner-time of a pair of non-reduced patterns. Define 
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The vector v corresponds to the vector fi with columns 13, 15, 16, 19, 20, 22, 23, 24 and 25 removed. 
The matrix Q corresponds to the matrix P but with rows and columns 13, 15, 16, 19, 20, 22, 23, 24 
and 25 removed. Finally, the vector u corresponds to the sum of columns 13, 15, 16, 19, 20, 22, 23, 
24 and 25 in P however with the rows of these same number removed. Since states 13, 15, 16, 19, 
20, 22, 23, 24 and 25 correspond to the detection of the pattern la#...M, it follows that 



Prob(T = n) = v Q n ~ 2 
The generating function associated with T is 



(n > 2). 



F(x) := ^Prob(T = n)x n , 

= x 2 ■ v ■ (I-x ■ Qy 1 -U, 



pqx 5 (2p 3 q 3 x 5 + p 3 q 3 x — 3p 3 q 3 x 3 + p q x + pq{\ — pq)x + p + q ) 
(1 — px)(l — qx)(l — pqx 2 ) 

where for the last identity we have used symbolic algebra software. 

This result for the generating function provides useful information about the distribution of T. 
For instance, if p ^ q then x = min{l/p, 1/q} is a simple zero and the closest zero to the origin of the 
denominator F(x). On the other hand, since (p 2 q 2 x 2 — 3p 3 q 3 x 3 ) > for all x £ [0, 1], the numerator 
of F(x) does not vanish at x = min{l/p, 1/q}. Hence Prob[T = n] ~ Ci(p, q) ■ (min-jjj, q})~ n as 
n — - > oo, where the constant ci(p,q) > is a computable constant that can be determined from the 
partial fraction decomposition of F(x). 

For the case p = q, we find that 



F{x) = 



x 5 (2x i ~ 3x 3 + 3a; 2 - 2x + 16) 
16(2 - a;) 3 ' 
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In this case, x — 2 is a zero (of order 3) of the denominator of F(x) but not of its numerator. Using 
the partial fraction decomposition of F(x) and (111, it follows that Prob[T — n) ~ C2 • n 2 /2 n as 
n — > oo, where C2 is a computable constant from the partial fraction decomposition of F(x). 

Refinements of this argument can be used to find explicit formulae for the probabilities and 
generating functions associated with the events X\...Xt = ...aa#...ba and X\...Xt = ...ba=f/=...bb. 



8 Conclusions 

In this paper we have reviewed the use of deterministic finite automata for probabilistic pattern 
matching. This view of the pattern matching problem allows many different problems to be addressed 
in a general framework, and unifies different ideas addressed in the computer science, mathematics, 
and bioinformatics literature. We have summarized the key results to present a self-contained 
mathematical summary of previous work, including definitions, theorems, proofs, and examples. 

The key results of deterministic automata are how to construct state machines from possibly 
simpler state machines to find matches of regular patterns in a given text. The Aho-Corasick 
automaton (based on the maximum prefix-suffix rule) is a classic example of an automaton that 
recognizes a set of keywords in a text. For matching compound patterns (i.e., containing multiple 
keywords), the synchronization of multiple automata is an important tool. 

For assessing the statistical significance of motif searches in biological sequence data, the pattern 
matching problem must be extended to determine the probability that a pattern occurs in a random 
string (specified by a given model) . Mathematically, this can be done by the Markov chain embedding 
of a random text into an automaton. This means considering a random walk on the automaton 
where the transition probabilities of the walk are determined by the model which generates the 
random string. This maps the probabilistic pattern matching problem onto a Markov chain, allowing 
techniques from combinatorics and the theory of Markov chains to be applied to the problem. In 
particular, the probability that a given set of patterns occurs in the random text corresponds to the 
probability that a specific Markov chain visits a certain set of terminal states. 

To illustrate the application of these ideas to biological sequence analysis, we presented two 
examples. In all the examples, we used a simplified binary alphabet and patterns that admit a 
simple description to illustrate the key ideas. 

The first application was the search for a compound pattern consisting of two keywords. We 
demonstrated how to determine the transition matrix of the Markov chain which determines the 
probability that one of the keywords occurs in the random string. We then derived the sooner-time 
probability distribution, the probability that any of the keywords first occurs after n characters in 
the random string, and used generating function methods to derive the asymptotic distributions for 
large n. We used similar mathematical methods to derive the probability that either of the two 
keywords occurs a given number of times in a random string of n characters. 

The second application was the search for a correlated modular pattern, in which two sub- 
patterns (modules) must appear in a certain order but can be separated by an arbitrary number of 
characters. Correlations mean that certain characters within the pattern can take different values, 
but the values must be correlated (for example, through base pairing). We illustrated the calculation 
of the frequency statistic of a modular pattern, including the asymptotic probability distribution for 
large n. We also derived formulae for the probability that a correlated modular pattern first appears 
after n characters in the string. 

We have focused in this review on random strings produced by memory less sources. However, 
we have provided references for Markovian sources and hidden Markov models (which can also be 
handled in this framework). 

These methods are applicable to determining the significance of motif searches in genome se- 
quences. In particular, modular and correlated patterns frequently occur in the sequences of func- 
tional RNA molecules. As the number of functional RNAs increases, the ability to infer the statistical 
significance of matches to RNA sequence patterns is increasingly important. A unified mathemati- 
cal framework (based on the concepts of automata, Markov chain embedding, and synchronization) 
can be used to analyze a range of biologically important pattern-matching problems, including the 
regulation of splicing and transcription and the probability of occurrence of catalytic RNA motifs in 
genomes or random-sequence RNA pools. All these apparently different problems can be addressed 
in the framework of probabilistic pattern matching. 
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